(Lin et al. 2018) (Durinck et al. 2009) (Durinck et al. 2005) (Davis and Meltzer 2007) (Huber et al. 2015) (R Core Team 2022) (Sanghi, Gruber, and Metwally 2021) (Law et al. 2016) (Robinson, DJ, and Smyth 2010) (McCarthy, Y, and Smyth 2012) (Chen, Lun, and Smyth 2016) (McCarthy, Y, and Smyth 2012) (Morgan 2021) (Kolberg et al. 2020) (Gu, Eils, and Schlesner 2016)

Introduction

The data set I selected from GEO (Gene Expression Omnibus) was “APOE4 Causes Widespread Molecular and Cellular Alterations Associated with Alzheimer’s Disease Phenotypes in Human iPSC-Derived Brain Cell Types” conducted by the lab in MIT on Illumina HiSeq 2000 palteform with GEO identifier GSE102956. This research compares the three controls(APOE3) and three tests(APOE4) iPSC-Derived Brain Cell to find whether APOE4 variant of APOE gene will cause Alzheimer’s Disease. In the previous assignment, the data set was first cleaned by checking duplication and removing the low gene counts and outliers. Then the mixed NCBI gene ids and HGNC ids are merged and mapped into HGNC id. Finally, the data was normalized by TMM to be better used in the future. 

Packages prep

if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
    
if (!requireNamespace("GEOmetadb", quietly = TRUE))
    BiocManager::install("GEOmetadb")

if (!requireNamespace("GEOquery", quietly = TRUE))
    BiocManager::install("GEOquery", force = TRUE)

if (!requireNamespace("knitr", quietly = TRUE))
    BiocManager::install("knitr")

if (!requireNamespace("edgeR", quietly = TRUE))
    BiocManager::install("edgeR")

if (!requireNamespace("biomaRt", quietly = TRUE))
    BiocManager::install("biomaRt")

if (!requireNamespace("magrittr", quietly = TRUE))
    BiocManager::install("magrittr")

if (!requireNamespace("kableExtra", quietly = TRUE))
    BiocManager::install("kableExtra")

if (!requireNamespace("ComplexHeatmap", quietly = TRUE))
    BiocManager::install("ComplexHeatmap")

if (!requireNamespace("circlize", quietly = TRUE))
    BiocManager::install("circlize")

if (!requireNamespace("limma", quietly = TRUE))
    BiocManager::install("limma")

if (!requireNamespace("gprofiler2", quietly = TRUE))
    BiocManager::install("gprofiler2")


library("BiocManager")
library("GEOmetadb")
library("GEOquery")
library("knitr")
library("edgeR")
library("biomaRt")
library("magrittr")
library("kableExtra")
library("ComplexHeatmap")
library("circlize")
library("limma")
library("gprofiler2")

Differential Gene Expression

Load normalized data from A1.

normalized_count_data <- read.table(file=file.path(getwd(), "data", "nomolized_gene"), header = TRUE,sep = ",",stringsAsFactors = FALSE,check.names=FALSE)

colnames(normalized_count_data)[1] <- "GeneID"

Create a heat map to visualize our data.

heatmap_matrix <- normalized_count_data[, 2:ncol(normalized_count_data)]
rownames(heatmap_matrix) <- normalized_count_data$GeneID
head(heatmap_matrix)
##            NEU_E3.1  NEU_E3.2   NEU_E3.3  NEU_E4.1   NEU_E4.2   NEU_E4.3
## 729737     1.229206  1.142812  0.8316146  1.502754  0.3241103  0.7077164
## LINC01128 41.301326 55.426388 57.9754153 66.443200 65.9024298 61.8072295
## LINC02593  6.146031  3.428436  2.3760416  3.756885  3.1330663  3.5385818
## SAMD11    93.050905 28.798866 26.9680723 34.241326 24.4163100 24.4162147
## NOC2L     82.725573 92.224939 93.1408311 80.719364 81.4597247 81.7412406
## KLHL17    13.890029 13.485183 11.4049997 10.411939  9.5072358  8.2566910
#Scale each row
heatmap_matrix <- t(scale(t(heatmap_matrix)))
if(min(heatmap_matrix) == 0){
    heatmap_col = colorRamp2(c( 0, max(heatmap_matrix)), 
                      c( "white", "red"))
  } else {
    heatmap_col = colorRamp2(c(min(heatmap_matrix), 0,
        max(heatmap_matrix)), c("blue", "white", "red"))
  }
current_heatmap <- Heatmap(name = "Gene Count Heatmap",
                           as.matrix(heatmap_matrix),
      show_row_dend = TRUE,show_column_dend = TRUE, 
      col=heatmap_col,show_column_names = TRUE, 
      show_row_names = FALSE,show_heatmap_legend = TRUE)
current_heatmap

Use Limma package to see our data clustering colored by cell type since cell type is the only control in my data set.

limma::plotMDS(main = "Multidimensional Scaling Plot",
               heatmap_matrix,
               col = c(rep("darkgreen",3), rep("blue",3)))

creates a design matrix

samples <- data.frame(cell_type = c(rep("NEU_E3", 3), rep("NEU_E4", 3)))
rownames(samples) <- colnames(normalized_count_data)[2:7]

model_design <- model.matrix(~ samples$cell_type )
kable(model_design[1:6,], type="html")
(Intercept) samples$cell_typeNEU_E4
1 0
1 0
1 0
1 1
1 1
1 1

Create our data matrix

expressionMatrix <- as.matrix(normalized_count_data[,2:7])
rownames(expressionMatrix) <- normalized_count_data$GeneID
colnames(expressionMatrix) <- colnames(normalized_count_data)[2:7]
minimalSet <- ExpressionSet(assayData=expressionMatrix)

#Fit our data to the above model
fit <- lmFit(minimalSet, model_design)

Apply empirical Bayes to compute differential expression for the above described model. We use Benjamini & Hochberg model as multiple hypothesis correction method, since it is the most commonly used and works very well in our dataset.

fit2 <- eBayes(fit, trend = TRUE)

topfit <- topTable(fit2, 
                   coef=ncol(model_design),
                   adjust.method = "BH",
                   number = nrow(expressionMatrix))
#merge hgnc names to topfit table
output_hits <- merge(normalized_count_data[,1:2],
                     topfit,
                     by.y=0,by.x=1,
                     all.y=TRUE)
output_hits <- output_hits[,-2]
#sort by pvalue
output_hits <- output_hits[order(output_hits$P.Value),]
head(output_hits)
##          GeneID      logFC    AveExpr         t      P.Value  adj.P.Val
## 25    100287846  13.803584   6.901792 15.459182 1.163843e-06 0.00849946
## 9680      RRAGB  82.255531  42.653304 15.265943 1.267914e-06 0.00849946
## 3623       ERC1 201.661568 313.959011 12.529843 4.825670e-06 0.02156592
## 7932     PCDHA6   9.085897   6.765306 10.808858 1.295026e-05 0.04340602
## 10220   SLC16A2 139.686820 212.644979  9.994541 2.173238e-05 0.05146164
## 7222    NDUFA10 -56.200447  90.237582 -9.702875 2.640008e-05 0.05146164
##               B
## 25    -2.594330
## 9680  -2.596579
## 3623  -2.639545
## 7932  -2.683062
## 10220 -2.710972
## 7222  -2.722503

Number of gene pass the threshold p-value < 0.05. We use p-value less that 0.05 by convention. Also the number of genes that are better than this p-value is resonable.

length(which(output_hits$P.Value < 0.05))
## [1] 680

Number of gene pass correction?

length(which(output_hits$adj.P.Val < 0.05))
## [1] 4

Plot the result and gene APOE

simple_model_pvalues <- data.frame(GeneID = output_hits$GeneID, simple_pvalue=output_hits$P.Value)
simple_model_pvalues$colour <- "gray"
simple_model_pvalues$colour[simple_model_pvalues$simple_pvalue < 0.05] <- "orange"
simple_model_pvalues$colour[simple_model_pvalues$GeneID == "APOE"] <- "red"
plot(simple_model_pvalues$simple_pvalue,
     col = simple_model_pvalues$colour,
     xlab = "simple model p-values",
     main="Simple Limma MA Plot")
points(which(simple_model_pvalues$GeneID == "APOE"),                                                 simple_model_pvalues[which(simple_model_pvalues$GeneID == "APOE"),2], pch=20,                 
       col="red", cex=1.5)
legend(0,1,legend=c("Significant","Insif", "APOE"),
       fill=c("orange","grey", "red"),cex = 0.7)

#This plot is only one dimensional because there is only one group of control versus one group of test. Thus we only have a simple model with one tyoe of grouping.
#The APOE's position on  the graph makes sense because APOE genes should not be expressed very differently. Instead APOE gene variant  will induce other genes express differently, which is the purpose of our study. 

Plot the heat map for only significant p-value genes.

top_hits <- output_hits$GeneID[output_hits$P.Value<0.05]

heatmap_matrix_tophits <- t(scale(t(heatmap_matrix[which(rownames(heatmap_matrix) %in%                                    top_hits),])))
if(min(heatmap_matrix_tophits) == 0){
    heatmap_col = colorRamp2(c( 0, max(heatmap_matrix_tophits)), 
                             c( "white", "red"))
  } else {
    heatmap_col = colorRamp2(c(min(heatmap_matrix_tophits), 0,
      max(heatmap_matrix_tophits)), c("blue", "white", "red"))
  }
current_heatmap <- Heatmap(name = "Differentialized P-Value Heapmap",
                           as.matrix(heatmap_matrix_tophits),
                           cluster_rows = TRUE,
                           cluster_columns = TRUE,
                               show_row_dend = TRUE,
                               show_column_dend = TRUE, 
                               col=heatmap_col,
                               show_column_names = TRUE, 
                               show_row_names = FALSE,
                               show_heatmap_legend = TRUE,
                               )
current_heatmap

#The result looks very promising with controls and test grouped together because controls have the same APOE3 gene and test have the same APOE4 genes.   

Limma package has already given a reasonable and informative result. I will not further dilapidated use EdgeR package since Limma and EdgeR are similar in the core and written by the same authors.

Thresholded over-representation analysis

How many genes are up regulated.

length(which(output_hits$P.Value < 0.05 
             & output_hits$logFC > 0))
## [1] 499

How many genes are down regulated?

length(which(output_hits$P.Value < 0.05 
             & output_hits$logFC < 0))
## [1] 181

Create thresholed lists of genes

output_hit_with_rank <- output_hits
output_hit_with_rank[,"rank"] <- -log(output_hit_with_rank$P.Value, base = 10) * sign(output_hit_with_rank$logFC)
output_hit_with_rank <- output_hit_with_rank[order(output_hit_with_rank$rank), ]
                                                       
upregulated_genes <- output_hits$GeneID[
  which(output_hits$P.Value < 0.05 
             & output_hits$logFC > 0)]
downregulated_genes <- output_hits$GeneID[
  which(output_hits$P.Value < 0.05 
             & output_hits$logFC < 0)]
write.table(x=upregulated_genes,
            file=file.path("data","upregulated_genes.txt"),sep = "\t",
            row.names = FALSE,col.names = FALSE,quote = FALSE)
write.table(x=downregulated_genes,
            file=file.path("data","downregulated_genes.txt"),sep = "\t",
            row.names = FALSE,col.names = FALSE,quote = FALSE)

write.table(x=data.frame(genename= output_hit_with_rank$GeneID,F_stat= output_hit_with_rank$rank),
            file=file.path("data","ranked_genelist.txt"),sep = "\t",
            row.names = FALSE,col.names = FALSE,quote = FALSE)

Use G:Profiler to perform gene enrichment analysis becaise it has both web interface and R package that are very handy. GO:BP (2022-12-04), Reactome(2022-12-28), and WikiPathways(2022-12-10) for annotation because they are comprehensive and cover most human biological pathways.

Upregulated genes

GEA_upragulated <- gprofiler2::gost(query = upregulated_genes, 
                                  organism = "hsapiens", 
                                  exclude_iea = TRUE,
                                  correction_method = "fdr",
                                  sources = c("GO:BP", "REAC", "WP"))

#we only want the term size <200 
GEA_upragulated_top <- data.frame(
  term_name = GEA_upragulated$result$term_name[GEA_upragulated$result$term_size < 200 &
                                               GEA_upragulated$result$term_size > 1],
  term_id = GEA_upragulated$result$term_id[GEA_upragulated$result$term_size < 200 &
                                           GEA_upragulated$result$term_size > 1],
  source = GEA_upragulated$result$source[GEA_upragulated$result$term_size < 200 &
                                         GEA_upragulated$result$term_size > 1]
)

knitr::kable(head(GEA_upragulated_top, 10), format = "html")
term_name term_id source
neuron migration GO:0001764 GO:BP
axon guidance GO:0007411 GO:BP
neuron projection guidance GO:0097485 GO:BP
synapse assembly GO:0007416 GO:BP
action potential GO:0001508 GO:BP
regulation of microtubule depolymerization GO:0031114 GO:BP
central nervous system neuron differentiation GO:0021953 GO:BP
regulation of microtubule cytoskeleton organization GO:0070507 GO:BP
regulation of microtubule polymerization or depolymerization GO:0031110 GO:BP
membrane depolarization during action potential GO:0086010 GO:BP
length(GEA_upragulated_top$term_name)
## [1] 82
# 82 genesets are returned for upregulated genes with 1 < term size < 200, and p value < 0.05 

Plot visualization

gprofiler2::gostplot(GEA_upragulated) %>% plotly::layout(title = "Upregulated genes plot", font = list(size = 10))

Downregulated genes

GEA_downragulated <- gprofiler2::gost(query = downregulated_genes, 
                                  organism = "hsapiens", 
                                  exclude_iea = TRUE,
                                  correction_method = "fdr",
                                  sources = c("GO:BP", "REAC", "WP"))

#we only want the term size <200 
GEA_downragulated_top <- data.frame(
  term_name = GEA_downragulated$result$term_name[GEA_downragulated$result$term_size < 200 &
                                               GEA_downragulated$result$term_size > 1],
  term_id = GEA_downragulated$result$term_id[GEA_downragulated$result$term_size < 200 &
                                           GEA_downragulated$result$term_size > 1],
  source = GEA_downragulated$result$source[GEA_downragulated$result$term_size < 200 &
                                         GEA_downragulated$result$term_size > 1]
)

knitr::kable(head(GEA_downragulated_top, 10), format = "html")
term_name term_id source
Response of EIF2AK1 (HRI) to heme deficiency REAC:R-HSA-9648895 REAC
2q37 copy number variation syndrome WP:WP5224 WP
Photodynamic therapy-induced unfolded protein response WP:WP3613 WP
length(GEA_downragulated_top$term_name)
## [1] 3
# 3 genesets are returned for downregulated genes with 1 < term size < 200, and p value < 0.05 

Plot visualization

gprofiler2::gostplot(GEA_downragulated) %>% plotly::layout(title = "Downregulated genes plot", font = list(size = 10))

All genes

all_genes <- gprofiler2::gost(query = output_hits$GeneID, 
                                  organism = "hsapiens", 
                                  exclude_iea = TRUE,
                                  correction_method = "fdr",
                                  sources = c("GO:BP", "REAC", "WP"))

#we only want the term size <200 
all_genes_top <- data.frame(
  term_name = all_genes$result$term_name[all_genes$result$term_size < 200 &
                                               all_genes$result$term_size > 1],
  term_id = all_genes$result$term_id[all_genes$result$term_size < 200 &
                                           all_genes$result$term_size > 1],
  source = all_genes$result$source[all_genes$result$term_size < 200 &
                                         all_genes$result$term_size > 1]
)

knitr::kable(head(all_genes_top, 10), format = "html")
term_name term_id source
rRNA processing GO:0006364 GO:BP
ribonucleoprotein complex assembly GO:0022618 GO:BP
regulation of mRNA processing GO:0050684 GO:BP
mitochondrial gene expression GO:0140053 GO:BP
cytoplasmic translation GO:0002181 GO:BP
regulation of RNA splicing GO:0043484 GO:BP
neuron projection extension GO:1990138 GO:BP
regulation of macroautophagy GO:0016241 GO:BP
vacuolar transport GO:0007034 GO:BP
mitochondrial translation GO:0032543 GO:BP
length(all_genes_top$term_name)
## [1] 1423
# 1423 genesets are returned for all genes with 1 < term size < 200, and p value < 0.05 

Plot visualization

gprofiler2::gostplot(all_genes) %>% plotly::layout(title = "All Genes genes plot", font = list(size = 10))

From the above three over-representations analysis, we are able to see that there are around 30 times more upregulated gene sets than down regulated gene sets, which suggests that the upregulated genes are more actively involved in biological pathways than downregulated genes. The overall gene data sets suepasses both up regulated and down regulated gene sets significantly suggests that the genes other than up regulated and down regulated also plays a crutial part in the overall pathways.

Interpretation

  1. Do the over-representation results support conclusions or mechanism discussed in the original paper?

The over-representation results amazingly support the mechanism discussed in the original paper. In the over-representation analysis above, we can see that up regulated genes are heavily involved in neural pathways, in which the top ten results are all related to neural transmission and neuron development. In the paper, the author concluded that the different variants of APEO gene will have an overall effete of neuron gene expression level, which thus cause Alzheimer’s disease.

  1. Can you find evidence, i.e. publications, to support some of the results that you see. How does this evidence support your results.

In the paper, the result showed that “APOE4 neurons exhibited increased synapse number and elevated Ab42 secretion relative to isogenic APOE3 cells” which perfectly supports our result. Because the upregulated genes are the genes that are significantly more in APOE4 neurons than APOE3 neurons. These upregulated genes are heavily involve in the synapse development pathways which is exactly what the papers illustrated.

Reference

Chen, Y., A. T. L. Lun, and G. K. Smyth. 2016. “From Reads to Genes to Pathways: Differential Expression Analysis of RNA-Seq Experiments Using Rsubread and the edgeR Quasi-Likelihood Pipeline.”
Davis, S., and P. Meltzer. 2007. “GEOquery: A Bridge Between the Gene Expression Omnibus (GEO) and BioConductor.” Bioinformatics 14: 1846–47.
Durinck, S., Y. Moreau, A. Kasprzyk, S. Davis, B. Moor, A. Brazma, and W. Huber. 2005. “BioMart and Bioconductor: A Powerful Link Between Biological Databases and Microarray Data Analysis.” Bioinformatics 21: 3439–40.
Durinck, S., P. Spellman, E. Birney, and W. Huber. 2009. “Mapping Identifiers for the Integration of Genomic Datasets with the r/Bioconductor Package biomaRt.” Nature Protocols 4: 1184–91.
Gu, Z., R. Eils, and M. Schlesner. 2016. “Complex Heatmaps Reveal Patterns and Correlations in Multidimensional Genomic Data.” Bioinformatics. https://doi.org/10.1093/bioinformatics/btw313.
Huber, W., V. J. Carey, R. Gentleman, S. Anders, M. Carlson, B. S. Carvalho, H. C. Bravo, et al. 2015. “Orchestrating High-Throughput Genomic Analysis with Bioconductor.” Nature Methods 12 (2): 115–21.
Kolberg, L., U. Raudvere, I. Kuzmin, J. Vilo, and H. Peterson. 2020. “Gprofiler2– an r Package for Gene List Functional Enrichment Analysis and Namespace Conversion Toolset g:profiler.”
Law, C. W., M. Alhamdoosh, S. Su, X. Dong, L. Tian, G. K. Smyth, and M. E. Ritchie. 2016. “RNA-Seq Analysis Is Easy as 1-2-3 with Limma, Glimma and edgeR.” F1000Research 5. https://doi.org/10.12688/f1000research.9005.3.
Lin, Yuan-Ta, Jinsoo Seo, Fan Gao, Heather M Feldman, Hsin-Lan Wen, Jay Penney, Hugh P Cam, et al. 2018. “Apoe4 Causes Widespread Molecular and Cellular Alterations Associated with Alzheimer’s Disease Phenotypes in Human iPSC-Derived Brain Cell Types.” Neuron 98 (6): 1141–54.
McCarthy, D. J., Chen Y, and G. K. Smyth. 2012. “Differential Expression Analysis of Multifactor RNA-Seq Experiments with Respect to Biological Variation.” Nucleic Acids Research 40: 4288–97.
Morgan, Martin. 2021. “BiocManager: Access the Bioconductor Project Package Repository.”
R Core Team. 2022. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. https://www.R-project.org/.
Robinson, M. D., McCarthy DJ, and G. K. Smyth. 2010. “edgeR: A Bioconductor Package for Differential Expression Analysis of Digital Gene Expression Data.” Bioinformatics 26: 139–40.
Sanghi, A., J. J. Gruber, and A. Metwally. 2021. “Chromatin Accessibility Associates with Protein-RNA Correlation in Human Cancer.” Nat Commun 12: 5732. https://doi.org/10.1038/s41467-021-25872-1.